The following libraries will be used for this assignment

suppressWarnings(library(ggplot2))
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(dbscan)
## Warning: package 'dbscan' was built under R version 3.6.1
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(class)
## Warning: package 'class' was built under R version 3.6.1
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.6.1
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.6.1
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess

Introduction

This assignment involves a forge detection problem. Specifically, we will work with the dataset Stamps , originally reported in Micenková, van Beusekom, and Shafait (2015), and available from the outlier data repository described in Campos et al. (2016). In particular, we use the original version of this dataset (not normalised, without duplicates), which contains 340 observations described by 9 variables (numerical predictors). Each observation (row) is a feature vector description of a Stamp, with 9 features (columns 1 to 9). Originally, this is a binary classification dataset, which contains forged (photocopied and printed) stamps as well as genuine (ink) stamps. The last (10th) column of the dataset contains the class labels (‘yes’ denotes forged, ‘no’ denotes genuine).

There are 309 genuine and 31 forged stamps. We will try to identify forged stamps (9.12%, presumably outliers) from genuine stamps (90.88%, presumably inliers), both in an unsupervised as well as in a supervised way. Class labels will be used for supervised learning only, in Activity 3 (supervised anomaly detection) of this assignment. In Activity 1 (PCA) and Activity 2 (unsupervised outlier detection), which are unsupervised, class labels will only be used for visualisation and external assessment of the results, learning will make use of the 9 numerical features only. The features are based on colour and printing properties of the stamps. The following code reads the data into memory and separates the 9 predictors ( PB_Predictors ) apart from the class labels ( PB_class ):

stamps <- read.table("Stamps_withoutdupl_09.csv", header=FALSE, sep=",", dec=".")
summary(stamps) # 9 Predictors (V1 to V9) and class labels (V10)
##        V1                V2               V3                V4        
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.04883   1st Qu.:0.3357   1st Qu.:0.04662   1st Qu.:0.2369  
##  Median :0.07872   Median :0.4252   Median :0.09584   Median :0.3456  
##  Mean   :0.10265   Mean   :0.4197   Mean   :0.14193   Mean   :0.3841  
##  3rd Qu.:0.12214   3rd Qu.:0.4416   3rd Qu.:0.19458   3rd Qu.:0.5233  
##  Max.   :1.00000   Max.   :0.8645   Max.   :1.00000   Max.   :1.0000  
##        V5               V6                V7               V8          
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000656  
##  1st Qu.:0.4524   1st Qu.:0.01010   1st Qu.:0.9183   1st Qu.:0.018221  
##  Median :0.5783   Median :0.01945   Median :0.9724   Median :0.032162  
##  Mean   :0.5986   Mean   :0.04151   Mean   :0.9299   Mean   :0.054981  
##  3rd Qu.:0.7373   3rd Qu.:0.04528   3rd Qu.:0.9884   3rd Qu.:0.059360  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.000000  
##        V9          V10     
##  Min.   :0.0000   no :309  
##  1st Qu.:0.4492   yes: 31  
##  Median :0.5787            
##  Mean   :0.5693            
##  3rd Qu.:0.7247            
##  Max.   :1.0000
PB_Predictors <- stamps[,1:9] # 9 Predictors (V1 to V9)
PB_class <- stamps[,10] # Class labels (V10)
PB_class <- ifelse(PB_class == 'no',0,1) # Inliers (class "no") = 0, Outliers (class "yes") = 1

Activity 1: Principal Component Analysis

1.Perform Principal Component Analysis (PCA) on the Stamps data in the 9-dimensional space of the numerical predictors ( PB_Predictors ), and show the Proportion of Variance Explained (PVE) for each of the nine resulting principal components. Plot the accumulated sum of PVE for the first components, as a function of , and discuss the result: (a) How many components do we need to explain 90% or more of the total variance? (b) How much of the total variance is explained by the first three components?

PCA <- prcomp(PB_Predictors, scale = TRUE)
PCA$rotation
##           PC1         PC2        PC3         PC4         PC5         PC6
## V1 -0.3570003 -0.10417655  0.1813733 -0.01930228  0.68124006  0.44187577
## V2  0.1420765 -0.30721615  0.1299597  0.92287473  0.06310383 -0.10282233
## V3 -0.3317113 -0.04657823 -0.4767126  0.17893590 -0.46295498  0.39982667
## V4 -0.4230165 -0.22831476 -0.3981637  0.02135029 -0.06765320 -0.08319882
## V5 -0.3853972 -0.33542901 -0.1876273 -0.07414299  0.26797160 -0.51370341
## V6 -0.3966169  0.39119428  0.1187845  0.11471184  0.04270746 -0.39846576
## V7  0.2747583 -0.64007315 -0.1336053 -0.23489388  0.10043151  0.09738610
## V8  0.1884198  0.40303353 -0.5625883  0.19360251  0.45311956  0.20043999
## V9  0.3828311  0.05284966 -0.4251531 -0.06436384  0.15488411 -0.39378111
##            PC7          PC8         PC9
## V1 -0.39311729  0.109611082  0.05031087
## V2 -0.03739759 -0.006920782  0.02905519
## V3 -0.31137242  0.026120489 -0.39585520
## V4  0.12867762 -0.063683586  0.76014917
## V5  0.26148066  0.279863425 -0.46460688
## V6 -0.26188791 -0.651400166 -0.09427161
## V7 -0.04026264 -0.635461942 -0.12955785
## V8  0.40078359 -0.181465084 -0.09499376
## V9 -0.65676748  0.209314272  0.10742798
# PVE - Proportion of the Variance Explained (PVE) 

PVE <- (PCA$sdev^2)/sum(PCA$sdev^2)

# plot PVE

PVE.df <- data.frame(PVE)
PVE.df <- cbind(PVE.df, c("PC1","PC2","PC3","PC4","PC5","PC6","PC7","PC8","PC9"))
colnames(PVE.df) <- c("PVE", "PC")
PVE.df$PC <- as.factor(PVE.df$PC)
PVE.plot <- ggplot(data=PVE.df, aes(x=PC, y= PVE)) +
  geom_bar(stat="identity", fill = "#375094") +
  labs(title = "Proportion of the Variance Explained (PVE)", x = "Principle Component(s)", y = "PVE %") +
  geom_text(aes(label=sprintf("%0.3f", round(PVE.df$PVE, digits = 4))), position=position_dodge(width=0.9), vjust=-0.25) +
  theme_minimal()
PVE.plot

# Plot Cumulative sum of PVE

plot(cumsum(PVE.df$PVE), main = "Cumulative Sum of PVE", 
     type = "b", ylab = "Cumulative Sum of PVE",
     xlab = "Principle Component(s)")

(a) How many components do we need to explain 90% or more of the total variance?

There are 6 components needed to explain 90% or more of the total variance

# PC PVE value

PC1 <- 0.399
PC2 <- 0.152
PC3 <- 0.138
PC4 <- 0.100
PC5 <- 0.078
PC6 <- 0.063

# Sum

PC1+PC2+PC3+PC4+PC5+PC6
## [1] 0.93

(b) How much of the total variance is explained by the first three components?

# sum first 3 PC

PC1+PC2+PC3
## [1] 0.689

2.Do some research by yourself on how to render 3D plots in R, and then plot a 3D scatter-plot of the Stamps data as represented by the first three principal components computed in the previous item ( x = PC1 , y = PC2 , and z = PC3 ). You can use, for example, the function scatter3D() from the package plot3D . Use the class labels ( PB_class ) to plot inliers and outliers in different colours (for example, inliers in black and outliers in red). Make sure you produce multiple plots from different angles (at least three). Recalling that the class labels would not be available in a practical application of unsupervised outlier detection, do the outliers (forged stamps) look easy to detect in an unsupervised way, assuming that the 3D visualisation of the data via PCA is a reasonable representation of the data in full space? How about in a supervised way? Why? Justify your answers.

# data manipulation for 3D plotting

PCA.subset <- cbind(PCA$x[,1], PCA$x[,2], PCA$x[,3])
PCA.subset <- data.frame(PCA.subset)
colnames(PCA.subset) <- c("PC1", "PC2", "PC3")
PB_class[which(PB_class == 0)] <- "Inlier"
PB_class[which(PB_class == 1)] <- "Outlier"
PB_class <- as.factor(PB_class)

# set scene for different camera angles

scene1 = list(camera = list(eye = list(x = 1.45, y = 1.45, z = 1.45)))
scene2 = list(camera = list(eye = list(x = -0.75, y = 2.40, z = 0.30)))
scene3 = list(camera = list(eye = list(x = -2.40, y = 0, z = 0.75)))
scene4 = list(camera = list(eye = list(x = 0, y = -2.40, z = 0.75)))


# intentionally do not use a loop so that you are able to display plots in diffferent tabs in RMD file


# 3D plot 1

plot.3d.stamps1 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'), 
                           marker = list(size = 4)) %>%
                           add_markers() %>%
                           layout(scene = scene1,
                                  title = "Angle 1")

# 3D plot 2

plot.3d.stamps2 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'), 
                           marker = list(size = 4)) %>%
                           add_markers() %>%
                           layout(scene = scene2,
                                  title = "Angle 2")

# 3D plot 3

plot.3d.stamps3 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'), 
                           marker = list(size = 4)) %>%
                           add_markers() %>%
                           layout(scene = scene3,
                                  title = "Angle 3")

# 3D plot 4

plot.3d.stamps4 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = PB_class, colors = c('black', 'red'), 
                           marker = list(size = 4)) %>%
                           add_markers() %>%
                           layout(scene = scene4,
                                  title = "Angle 4")

3D plots for PC 1:3

3D plot 1

plot.3d.stamps1

3D plot 2

plot.3d.stamps2

3D plot 3

plot.3d.stamps3 

3D plot 4

plot.3d.stamps4 

do the outliers (forged stamps) look easy to detect in an unsupervised way

In terms of identifying these observations as outliers without the given class labels, more detailed analysis would need to be undertaken to look at the “outlyingness” scores of each observation. However if we are looking solely at the 3D plots to determine this, you are able to see that the outliers coloured in red are not part of the main cluster of the Principle Components and so you could infer some level of “outlyingness” for these observations. Having said that there are also a number of observations which are further away from the clustering of the PC’s which haven’t been labelled as outliers from the original class labels so i would’t be confident in defining outliers based solely on the PCA analysis.There also doesn’t appear to be any outliers appearing in the main cluster of observations within PCA which does show that the PCA analysis is sound in identifying the outliers to some degree.

How about in a supervised way?

Looking at the class labels in the 3D plot i believe the data could be used for supervised learning. Using class labels it is clear that the red observations are not part of the main cluster of data. There are observations which are further away which are marked as inliers however there is still a decent number of outliers clustered together. Depeding on the type of classification algorithm is used and the sensitivity requirements around classification, thresholds could potentially be used to help classify some of the other observations what also appear to be outliers in the 3D plots.i.e in logistic regression.

Activity 2: Unsupervised outlier detection

In this second activity, you are asked to perform unsupervised outlier detection on the Stamps data in the 9-dimensional space of the numerical predictors ( PB_Predictors ), using KNN Outlier with different values of the parameter (at least the following three: ). For each , produce the same 3D PCA visualisation of the data as in Activity 1 (PCA), but rather than using the class labels to colour the points, use instead the resulting KNN Outlier Scores as a continuous, diverging colour scale. Then, for each , produce a second plot where the top-31 outliers according to the KNN Outlier Scores are shown in red, while the other points are shown in black. Do these plots give you any insights on the values of that look more or less appropriate from an unsupervised perspective (ignoring the class labels)? Justify your answer.

# Activity 2

# intentionally do not use a loop so that you are able to display plots in diffferent tabs in RMD file

# set k

k <- c(5,25,100)

# unsupervised outlier detection using knndist() 

# k = 5

KNN_Outlier.k5 <- kNNdist(x=PB_Predictors, k = k[1])[,k[1]] # KNN distance (outlier score) computation


# sort & display top 31 based on outlier scores

top_n <- 31 # No. of top outliers to be displayed

rank_KNN_Outlier.k5 <- order(x=KNN_Outlier.k5, decreasing = TRUE) # Sorting (descending)
KNN_Result.k5 <- data.frame(ID = rank_KNN_Outlier.k5, score = KNN_Outlier.k5[rank_KNN_Outlier.k5])
head(KNN_Result.k5, top_n)
##     ID     score
## 1  150 1.0831392
## 2  271 0.9452268
## 3   22 0.8701538
## 4    2 0.8176631
## 5  130 0.7382603
## 6  328 0.6434083
## 7   19 0.6400374
## 8  199 0.4853853
## 9  197 0.4831582
## 10  88 0.4626465
## 11 116 0.4620797
## 12 191 0.4619196
## 13 276 0.4239779
## 14  49 0.4228653
## 15 186 0.4188267
## 16 273 0.4150218
## 17 188 0.3989988
## 18 102 0.3716263
## 19 149 0.3706868
## 20 238 0.3685038
## 21 297 0.3653524
## 22   3 0.3650678
## 23 169 0.3559213
## 24 267 0.3557550
## 25  12 0.3552236
## 26   4 0.3532579
## 27 263 0.3448771
## 28 136 0.3406615
## 29 171 0.3377005
## 30  23 0.3282753
## 31 178 0.3199439
# k = 25

KNN_Outlier.k25 <- kNNdist(x=PB_Predictors, k = k[2])[,k[2]] # KNN distance (outlier score) computation

# sort & display top 31 based on outlier scores

rank_KNN_Outlier.k25 <- order(x=KNN_Outlier.k25, decreasing = TRUE) # Sorting (descending)
KNN_Result.k25 <- data.frame(ID = rank_KNN_Outlier.k25, score = KNN_Outlier.k25[rank_KNN_Outlier.k25])
head(KNN_Result.k25, top_n)
##     ID     score
## 1  150 1.2963491
## 2  271 1.0615617
## 3   22 1.0096438
## 4    2 0.9969271
## 5  328 0.9158535
## 6  130 0.8550355
## 7   19 0.7513259
## 8  199 0.5966591
## 9  276 0.5925294
## 10 197 0.5920398
## 11  88 0.5896104
## 12 171 0.5894291
## 13 238 0.5829931
## 14 169 0.5788086
## 15 116 0.5686392
## 16 179 0.5632258
## 17 273 0.5522833
## 18 198 0.5520698
## 19 334 0.5477680
## 20  24 0.5399514
## 21  25 0.5319735
## 22 191 0.5314182
## 23 263 0.5119748
## 24 267 0.5098777
## 25  49 0.5092998
## 26 186 0.4998188
## 27  26 0.4942470
## 28 149 0.4910287
## 29  16 0.4909396
## 30 188 0.4832703
## 31 316 0.4763570
# k = 100

KNN_Outlier.k100 <- kNNdist(x=PB_Predictors, k = k[3])[,k[3]] # KNN distance (outlier score) computation

# sort & display top 31 based on outlier scores

rank_KNN_Outlier.k100 <- order(x=KNN_Outlier.k100, decreasing = TRUE) # Sorting (descending)
KNN_Result.k100 <- data.frame(ID = rank_KNN_Outlier.k100, score = KNN_Outlier.k100[rank_KNN_Outlier.k100])
head(KNN_Result.k100, top_n)
##     ID     score
## 1  150 1.5027937
## 2  271 1.1867804
## 3   22 1.1317200
## 4    2 1.1002580
## 5  328 1.0519702
## 6  130 1.0366054
## 7   19 0.8689258
## 8  238 0.8155971
## 9  169 0.8117624
## 10 171 0.8028156
## 11  88 0.7894132
## 12 334 0.7701768
## 13 179 0.7664834
## 14  16 0.7528911
## 15 199 0.7516090
## 16 116 0.7475236
## 17 198 0.7337230
## 18 276 0.7317956
## 19 149 0.7125346
## 20  25 0.7076002
## 21  24 0.7062274
## 22 197 0.7046969
## 23 316 0.6814831
## 24  14 0.6769145
## 25  18 0.6761360
## 26  12 0.6750803
## 27 273 0.6735271
## 28  26 0.6709476
## 29 267 0.6708538
## 30 191 0.6583719
## 31  49 0.6572741
# Plot PCA - K = 5

plot.3d.knndist.k5 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = KNN_Outlier.k5,
                           marker = list(size = 4)) %>%
                           layout(title = "PCA ranked by KNN OUtlier score - K = 5")

# Plot PCA - K = 25

plot.3d.knndist.k25 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = KNN_Outlier.k25,
                              marker = list(size = 4)) %>%
                              layout(title = "PCA ranked by KNN OUtlier score - K = 25")

# Plot PCA - K = 100

plot.3d.knndist.k100 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = KNN_Outlier.k100,
                               marker = list(size = 4)) %>%
                               layout(title = "PCA ranked by KNN OUtlier score - K = 100")


# PCA with Top31 Outlier scores colour coded black & red

# create top 31 results for each k

k5.t31 <- head(KNN_Result.k5, top_n)
k25.t31 <- head(KNN_Result.k25, top_n)
k100.t31 <- head(KNN_Result.k100, top_n)


# bind together PCA data and KNN data for plotting for each k
PCA.k5 <- cbind(PCA.subset, KNN_Result.k5)
PCA.k25 <- cbind(PCA.subset, KNN_Result.k25)
PCA.k100 <- cbind(PCA.subset, KNN_Result.k100)

# left outer join/merge PCA & KNN results with top31 results to create a dataframe(s) that can be used for plotting

PCA.k5.t31 <- merge(PCA.k5, k5.t31 , by = "ID", all.x = TRUE)
PCA.k25.t31 <- merge(PCA.k25, k25.t31 , by = "ID", all.x = TRUE)
PCA.k100.t31 <- merge(PCA.k100, k100.t31 , by = "ID", all.x = TRUE)

# add column for colouring of plots
# k5
PCA.k5.t31 <- PCA.k5.t31 %>% 
  mutate(class = case_when(is.na(score.y) ~ "PCA Obs",
                           score.y > 0 ~ "Top 31 KNN Outlier"))

# k25
PCA.k25.t31 <- PCA.k25.t31 %>% 
  mutate(class = case_when(is.na(score.y) ~ "PCA Obs",
                           score.y > 0 ~ "Top 31 KNN Outlier"))

# k100
PCA.k100.t31 <- PCA.k100.t31 %>% 
  mutate(class = case_when(is.na(score.y) ~ "PCA Obs",
                           score.y > 0 ~ "Top 31 KNN Outlier"))


# Plot the top 31 KNN Outlier's against original PCA data

# k5

plot.3d.knndist.k5.t31 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = ~PCA.k5.t31$class, 
                                  text = rownames(PCA.k5.t31),
                                  colors = c("black", "red"),
                                  marker = list(size = 4)) %>%
  layout(title = "PCA with Top 31 KNN OUtlier's Ranked on Score - K = 5")


# k25

plot.3d.knndist.k25.t31 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = ~PCA.k25.t31$class,
                                  text = rownames(PCA.k25.t31),
                                  colors = c("black", "red"),
                                  marker = list(size = 4)) %>%
  layout(title = "PCA with Top 31 KNN OUtlier's Ranked on Score - K = 25")

# k100

plot.3d.knndist.k100.t31 <- plot_ly(PCA.subset, x = ~PC1, y = ~PC2, z = ~PC3, color = ~PCA.k100.t31$class,
                                  text = rownames(PCA.k100.t31),
                                  colors = c("black", "red"),
                                  marker = list(size = 4)) %>%
  layout(title = "PCA with Top 31 KNN OUtlier's Ranked on Score - K = 100")

# Plots

Outlier score against PCA plots - continuous, diverging colourscale

Plot K = 5

plot.3d.knndist.k5
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Plot K = 25

plot.3d.knndist.k25
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Plot K = 100

plot.3d.knndist.k100
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Top 31 outliers according to KNN Outlier Score

Plot K = 5

plot.3d.knndist.k5.t31
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Plot K = 25

plot.3d.knndist.k25.t31
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Plot K = 100

plot.3d.knndist.k100.t31
## No trace type specified:
##   Based on info supplied, a 'scatter3d' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter3d
## No scatter3d mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Do these plots give you any insights on the values of that look more or less appropriate from an unsupervised perspective (ignoring the class labels)?

Looking at the 3 plots with the continuous colour scale it is clear that there isn’t a large number of observations which have a high outlier score. This is clear as the number of observations which have a yellow type colour is quite low between the 3 plots. You are able to identify a large cluster of data which has the dark blue colour and you can see that observations increase their distance from that cluster their outlier score is increasing as the colour changes based on the colour scale. The difference in K doesn’t seem to have much impact on the clstering and outlier detection as the plots are very similar. The socres themselves change slightly as they increase with K = 100 however the plots aren’t really effected.

Analysing the Top31 plots its similar to above. The different K values doesn’t seem to change the plots much at all. If i had to use this model in an unsupervised way i would keep K = 5 or potentiall do some more analysis to see if that could be reduced even further based on potentially overfitting the model.

Activity 3: Supervised anomaly detection

1.Perform supervised classification of the Stamps data, using a KNN classifier with the same values of as used in Activity 2 (unsupervised outlier detection). For each classifier (that is, each value of ), compute the Area Under the Curve ROC (AUC-ROC) in a Leave-One-Out Cross-Validation (LOOCV) scheme.

# 1

# Function to create ROC plot

rocplot <- function(pred, truth){
  predobj <- prediction(pred, truth)
  ROC     <- performance(predobj, "tpr", "fpr")
  # Plot the ROC Curve
  plot(ROC)
  auc     <- performance(predobj, measure = "auc")
  auc     <- auc@y.values[[1]]
  # Return the Area Under the Curve ROC
  return(auc)
}

# separate class labels from predictors

predictors <- stamps[c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9")]
class_labels <- stamps[, "V10"]

# Loop and plot through each k value of 5, 25, 100

for (i in 1:3){
Pred_class <- knn.cv(train=predictors, cl=class_labels, k=k[i], prob = TRUE)
Pred_prob <- attr(Pred_class, "prob")
Pred_prob <- ifelse(Pred_class=='yes', Pred_prob, 1 - Pred_prob)
AUC <- rocplot(pred=Pred_prob, truth=class_labels)
abline(a=0, b= 1)
text(x = .40, y = .6,paste("AUC = ", round(AUC[[1]],3), sep = ""))
}

2. Compare the resulting (supervised) KNN classification performance for each value of , against the classification performance obtained in an unsupervised way by the KNN Outlier method with the same value of . Notice that, if we rescale the KNN Outlier Scores (obtained in Activity 2 (unsupervised outlier detection)) into the interval, these scores can be interpreted as outlier probabilities, which can then be compared with the class labels (ground truth) in PB_class to compute an AUC-ROC value. This way, for each value of , the AUC-ROC of the supervised KNN classifier can be compared with the AUC-ROC of KNN Outlier as an unsupervised classifier. Compare the performances of the supervised versus unsupervised classifiers and discuss the results. For example, recalling that the supervised method makes use of the class labels, whereas the unsupervised method doesn’t, what can you conclude considering there are applications where class labels are not available?

# 2

kNN_scale.k5 <- (KNN_Outlier.k5 - min(KNN_Outlier.k5)) / (max(KNN_Outlier.k5) - min(KNN_Outlier.k5))
kNN_scale.k25 <- (KNN_Outlier.k25 - min(KNN_Outlier.k25)) / (max(KNN_Outlier.k25) - min(KNN_Outlier.k25))
kNN_scale.k100 <- (KNN_Outlier.k100 - min(KNN_Outlier.k100)) / (max(KNN_Outlier.k100) - min(KNN_Outlier.k100))


# Compare supervised against unsupervised and plot K = 5

# create 2 predictor classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)
supervised.pred.k5 <- prediction(Pred_prob, class_labels)
unsupervised.pred.k5 <- prediction(kNN_scale.k5, class_labels)

# create 2 performance classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)

supervised.perf.k5 <- performance( supervised.pred.k5, "tpr", "fpr" )
unsupervised.perf.k5 <- performance(unsupervised.pred.k5, "tpr", "fpr")

# get auc for K = 5 (outlier score)

auc.k5 <- performance(unsupervised.pred.k5, measure = "auc")
auc.k5 <- auc.k5@y.values[[1]]


# Compare supervised against unsupervised and plot K = 25

# create 2 predictor classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)

supervised.pred.k25 <- prediction(Pred_prob, class_labels)
unsupervised.pred.k25 <- prediction(kNN_scale.k25, class_labels)

# create 2 performance classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)

supervised.perf.k25 <- performance(supervised.pred.k25, "tpr", "fpr" )
unsupervised.perf.k25 <- performance(unsupervised.pred.k25, "tpr", "fpr")

# get auc for K = 25 (outlier score)

auc.k25 <- performance(unsupervised.pred.k25, measure = "auc")
auc.k25 <- auc.k25@y.values[[1]]


# Compare supervised against unsupervised and plot K = 100

# create 2 predictor classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)

supervised.pred.k100 <- prediction(Pred_prob, class_labels)
unsupervised.pred.k100 <- prediction(kNN_scale.k100, class_labels)

# create 2 performance classes for comparison. Supervised (knn.cv) & Unsupervised (Oulier score)

supervised.perf.k100 <- performance(supervised.pred.k100, "tpr", "fpr" )
unsupervised.perf.k100 <- performance(unsupervised.pred.k100, "tpr", "fpr")

# get auc for K = 100 (outlier score)

auc.k100 <- performance(unsupervised.pred.k100, measure = "auc")
auc.k100 <- auc.k100@y.values[[1]]

AUC-ROC Supervised Vs Unsupervised

K = 5

# plot for comparison 

plot(supervised.perf.k5, col = "dark blue", main = "Supervised vs Unsupervised K = 5")
plot(unsupervised.perf.k5, add = TRUE, col = "dark green")
legend("topleft", legend = c("Supervised", "Unsupervised"),
       col = c("dark blue", "dark green"), lty = 1, cex = 0.8)
abline(a=0, b= 1)
text(x = .80, y = .27,paste("AUC Supervised = ", round(auc.k5[[1]],3), sep = ""), col = "dark blue")
text(x = .79, y = .21,paste("AUC Unsupervised = ", 0.942, sep = ""), col = "dark green")

K = 25

# plot for comparison 

plot(supervised.perf.k25, col = "dark blue", main = "Supervised vs Unsupervised K = 25")
plot(unsupervised.perf.k25, add = TRUE, col = "dark green")
legend("topleft", legend = c("Supervised", "Unsupervised"),
       col = c("dark blue", "dark green"), lty = 1, cex = 0.8)
abline(a=0, b= 1)
text(x = .80, y = .27,paste("AUC Supervised = ", round(auc.k25[[1]],3), sep = ""), col = "dark blue")
text(x = .79, y = .21,paste("AUC Unsupervised = ", 0.942, sep = ""), col = "dark green")

K = 100

# plot for comparison 

plot(supervised.perf.k100, col = "dark blue", main = "Supervised vs Unsupervised K = 100")
plot(unsupervised.perf.k100, add = TRUE, col = "dark green")
legend("topleft", legend = c("Supervised", "Unsupervised"),
       col = c("dark blue", "dark green"), lty = 1, cex = 0.8)
abline(a=0, b= 1)
text(x = .80, y = .27,paste("AUC Supervised = ", round(auc.k100[[1]],3), sep = ""), col = "dark blue")
text(x = .79, y = .21,paste("AUC UnSupervised = ", 0.942, sep = ""), col = "dark green")

Compare the performances of the supervised versus unsupervised classifiers and discuss the results. For example, recalling that the supervised method makes use of the class labels, whereas the unsupervised method doesn’t, what can you conclude considering there are applications where class labels are not available?

The first difference between the plots is that the AUC increases with the value of K increasing. As mentioned above this is to expected and further analysis around the KNN Outlier logic to understand whether the model is becoming overfitted and what the most efficient value of K should be. As the value of the AUC increases we are able to see that the model’s prediction starts to improve with the true positive rate also increasing. As it’s not always possible to perform supervised learning we can conclude that deeper analysis on the right value of K should be undertaken to understand model overfitting. Looking solely at the plots however as K increases so does the AUC and so does the true positive rate. The false positive rate doesn’t change and this is expected as none of the plots seen in this report have has misclassification of class labels or outliers identified as inliers.